16 research outputs found

    Speaker diarization assisted ASR for multi-speaker conversations

    Full text link
    In this paper, we propose a novel approach for the transcription of speech conversations with natural speaker overlap, from single channel recordings. We propose a combination of a speaker diarization system and a hybrid automatic speech recognition (ASR) system with speaker activity assisted acoustic model (AM). An end-to-end neural network system is used for speaker diarization. Two architectures, (i) input conditioned AM, and (ii) gated features AM, are explored to incorporate the speaker activity information. The models output speaker specific senones. The experiments on Switchboard telephone conversations show the advantage of incorporating speaker activity information in the ASR system for recordings with overlapped speech. In particular, an absolute improvement of 11%11\% in word error rate (WER) is seen for the proposed approach on natural conversation speech with automatic diarization.Comment: Manuscript submitted to INTERSPEECH 202

    Coswara -- A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis

    Full text link
    The COVID-19 pandemic presents global challenges transcending boundaries of country, race, religion, and economy. The current gold standard method for COVID-19 detection is the reverse transcription polymerase chain reaction (RT-PCR) testing. However, this method is expensive, time-consuming, and violates social distancing. Also, as the pandemic is expected to stay for a while, there is a need for an alternate diagnosis tool which overcomes these limitations, and is deployable at a large scale. The prominent symptoms of COVID-19 include cough and breathing difficulties. We foresee that respiratory sounds, when analyzed using machine learning techniques, can provide useful insights, enabling the design of a diagnostic tool. Towards this, the paper presents an early effort in creating (and analyzing) a database, called Coswara, of respiratory sounds, namely, cough, breath, and voice. The sound samples are collected via worldwide crowdsourcing using a website application. The curated dataset is released as open access. As the pandemic is evolving, the data collection and analysis is a work in progress. We believe that insights from analysis of Coswara can be effective in enabling sound based technology solutions for point-of-care diagnosis of respiratory infection, and in the near future this can help to diagnose COVID-19.Comment: A description of Coswara dataset to evaluate COVID-19 diagnosis using respiratory sound

    Late Reverberation Cancellation Using Bayesian Estimation of Multi-Channel Linear Predictors and Student's t-Source Prior

    No full text
    Multi-channel linear prediction (MCLP) can model the late reverberation in the short-time Fourier transform domain using a delayed linear predictor and the prediction residual is taken as the desired early reflection component. Traditionally, a Gaussian source model with time-dependent precision (inverse of variance) is considered for the desired signal. In this paper, we propose a Student's t-distribution model for the desired signal, which is realized as a Gaussian source with a Gamma distributed precision. Further, since the choice of a proper MCLP order is critical, we also incorporate a Gaussian distribution prior for the prediction coefficients and a higher order. We consider a batch estimation scenario and develop variational Bayes expectation maximization (VBEM) algorithm for joint posterior inference and hyper-parameter estimation. This has lead to more accurate and robust estimation of the late reverb component and hence its cancellation, benefitting the desired residual signal estimation. Along with these stochastic models, we formulate single channel output (MISO) and multi channel output (MIMO) schemes using shared priors for the desired signal precision and the estimated MCLP coefficients at each microphone. Experiments using real room impulse responses show improved late reverberation suppression with the proposed VBEM approach over the traditional methods, for different room conditions. Additionally, we achieve a sparse coefficient vector for the MCLP avoiding the criticality of manually choosing the model order. The MIMO formulation is easily extended to include spatial filtering of the enhanced signals, which further improves the estimation of the desired signal

    TIME VARYING LINEAR PREDICTION USING SPARSITY CONSTRAINTS

    No full text
    Time-varying linear prediction has been studied in the context of speech signals, in which the auto-regressive (AR) coefficients of the system function are modeled as a linear combination of a set of known bases. Traditionally, least squares minimization is used for the estimation of model parameters of the system. Motivated by the sparse nature of the excitation signal for voiced sounds, we explore the time-varying linear prediction modeling of speech signals using sparsity constraints. Parameter estimation is posed as a 0-norm minimization problem. The re-weighted 1-norm minimization technique is used to estimate the model parameters. We show that for sparsely excited time-varying systems, the formulation models the underlying system function better than the least squares error minimization approach. Evaluation with synthetic and real speech examples show that the estimated model parameters track the formant trajectories closer than the least squares approach

    Joint Bayesian Estimation of Time-Varying LP Parameters and Excitation for Speech

    No full text
    We consider the joint estimation of time-varying linear prediction (TVLP) filter coefficients and the excitation signal parameters for the analysis of long-term speech segments. Traditional approaches to TVLP estimation assume linear expansion of the coefficients in a set of known basis functions only. But, excitation signal is also time-varying, which affects the estimation of TVLP filter parameters. In this letter, we propose a Bayesian approach, to incorporate the nature of excitation signal and also adapt regularization of the filter parameters. Since the order of the system is not known a priori, we formulate a Gaussian prior for the filter parameters, and the excitation signal is modeled as Gaussian with time-varying Gamma distributed precision. We develop an iterative algorithm for the maximum-likelihood estimation of the posterior distribution of filter parameters and the time-varying precision of the excitation signal, along with the parameters of the prior distribution. We show that the proposed method adapts to different types of excitation signals in speech, and also the time-varying system with unknown model order. The spectral modeling performance for synthetic speech-like signals, quantified using the absolute spectral difference shows that the proposed method estimates the system function more accurately compared to several of the traditional methods

    LINEAR PREDICTION BASED DIFFUSE SIGNAL ESTIMATION FOR BLIND MICROPHONE GEOMETRY CALIBRATION

    No full text
    Spatial cross coherence function between two locations in a diffuse sound field is a function of the distance between them. Earlier approaches to microphone geometry calibration utilizing this property assume the presence of an ambient noise source. Instead, we consider the geometry estimation using a single acoustic source (not noise) and show that late reverberation (diffuse signal) estimation using multi-channel linear prediction (MCLP) provides a computationally efficient solution to geometry estimation. The idea behind this is that, the component of a reverberant signal corresponding to late reflections satisfies the diffuse sound field properties, which we exploit for distance estimation between microphone pairs. MCLP of short-time Fourier transform (STFT) coefficients is used to decompose each microphone signal into early and late reflection components. Cross coherence computed between the separated late reflection components is then used for pair-wise microphone distance estimation. Multidimensional scaling (MDS) is then used to estimate the geometry of the microphones from pair-wise distance measurements. We show that, higher reverberation, though detrimental to signal estimation, can aid in microphone geometry estimation. Estimated position error of less than 2 cm is achieved using the proposed approach for real microphone recorded signals
    corecore